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Abstract 

We present a model that generates free-form natural lan- 
guage descriptions of image regions. Our model leverages 
datasets of images and their sentence descriptions to learn 
about the inter-modal correspondences between text and vi- 
sual data. Our approach is based on a novel combination 
of Convolutional Neural Networks over image regions, bidi- 
rectional Recurrent Neural Networks over sentences, and a 
structured objective that aligns the two modalities through a 
multimodal embedding. We then describe a Recurrent Neu- 
ral Network architecture that uses the inferred alignments to 
learn to generate novel descriptions of image regions. We 
demonstrate the effectiveness of our alignment model with 
ranking experiments on FlickrSK, FlickrSOK and COCO 
datasets, where we substantially improve on the state of the 
art. We then show that the sentences created by our gen- 
erative model outperform retrieval baselines on the three 
aforementioned datasets and a new dataset of region-level 
annotations. 

1. Introduction 

A quick glance at an image is sufficient for a human to point 
out and describe an immense amount of details about the vi- 
sual scene [ ]. However, this remarkable ability has proven 
to be an elusive task for our visual recognition models. The 
majority of previous work in visual recognition has focused 
on labeling images with a fixed set of visual categories, and 
great progress has been achieved in these endeavors [36, 6]. 
However, while closed vocabularies of visual concepts con- 
stitute a convenient modeling assumption, they are vastly 
restrictive when compared to the enormous amount of rich 
descriptions that a human can compose. 

Some pioneering approaches that address the challenge of 
generating image descriptions have been developed [ 2, 7]. 
However, these models often rely on hard-coded visual con- 
cepts and sentence templates, which imposes limits on their 
variety. Moreover, the focus of these works has been on re- 
ducing complex visual scenes into a single sentence, which 
we consider as an unnecessary restriction. 
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Figure 1 . Our model generates free-form natural language descrip- 
tions of image regions. 



In this work, we strive to take a step towards the goal of 
generating dense, free-form descriptions of images (Figure 
1). The primary challenge towards this goal is in the de- 
sign of a model that is rich enough to reason simultaneously 
about contents of images and their representation in the do- 
main of natural language. Additionally, the model should 
be free of assumptions about specific hard-coded templates, 
rules or categories and instead rely primarily on training 
data. The second, practical challenge is that datasets of im- 
age captions are available in large quantities on the internet 
[14, 46, 29], but these descriptions multiplex mentions of 
several entities whose locations in the images are unknown. 

Our core insight is that we can leverage these large image- 
sentence datasets by treating the sentences as weak labels, 
in which contiguous segments of words correspond to some 
particular, but unknown location in the image. Our ap- 
proach is to infer these alignments and use them to learn 
a generative model of descriptions. Concretely, our contri- 
butions are twofold: 

• We develop a deep neural network model that in- 
fers the latent alignment between segments of sen- 
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tences and the region of the image that they describe. 
Our model associates the two modahties through a 
common, multimodal embedding space and a struc- 
tured objective. We validate the effectiveness of this 
approach on image-sentence retrieval experiments in 
which we surpass the state-of-the-art. 

• We introduce a multimodal Recurrent Neural Network 
architecture that takes an input image and generates 
its description in text. Our experiments show that the 
generated sentences significantly outperform retrieval- 
based baselines, and produce sensible qualitative pre- 
dictions. We then train the model on the inferred cor- 
respondences and evaluate its performance on a new 
dataset of region-level annotations. 

We make our code, data and annotations publicly available. 

2. Related Work 

Dense image annotations. Our work shares the high-level 
goal of densely annotating the contents of images with 
many works before us. Barnard et al. [1] and Socher et 
al. [3S-] studied the multimodal correspondence between 
words and images to annotate segments of images. Several 
works [26, 12, 9] studied the problem of holistic scene un- 
derstanding in which the scene type, objects and their spa- 
tial support in the image is inferred. However, the focus of 
these works is on correctly labeling scenes, objects and re- 
gions with a fixed set of categories, while our focus is on 
richer and higher-level descriptions of regions. 

Generating textual descriptions. Multiple works have ex- 
plored the goal of annotating images with textual descrip- 
tions on the scene level. A number of approaches pose 
the task as a retrieval problem, where the most compatible 
annotation in the training set is transferred to a test image 
[14, 39, 7, 34, 17], or where training annotations are broken 
up and stitched together [23, 27, 24]. However, these meth- 
ods rely on a large amount of training data to capture the 
variety in possible outputs, and are often expensive at test 
time due to their non-parametric nature. Several approaches 
have been explored for generating image captions based on 
fixed templates that are filled based on the content of the im- 
age [ ' 3, 22, 7, 43, 44, 4]. This approach still imposes limits 
on the variety of outputs, but the advantage is that the final 
results are more likely to be syntactically correct. Instead 
of using a fixed template, some approaches that use a gen- 
erative grammar have also been developed [33, 45]. More 
closely related to our approach is the work of Srivastava et 
al. [40] who use a Deep Boltzmann Machine to learn a joint 
distribution over a images and tags. However, they do not 
generate extended phrases. More recently, Kiros et al. [19] 
developed a log-bilinear model that can generate full sen- 
tence descriptions. However, their model uses a fixed win- 



dow context, while our Recurrent Neural Network model 
can condition the probability distribution over the next word 
in the sentence on all previously generated words. 

Grounding natural language in images. A number of ap- 
proaches have been developed for grounding textual data in 
the visual domain. Kong et al. [ ] develop a Markov Ran- 
dom Field that infers correspondences from parts of sen- 
tences to objects to improve visual scene parsing in RGBD 
images. Matuszek et al. [^-'*] learn joint language and per- 
ception model for grounded attribute learning in a robotic 
setting. Zitnick et al. [48] reason about sentences and 
their grounding in cartoon scenes. Lin et al. [ ] retrieve 
videos from a sentence description using an intermediate 
graph representation. The basic form of our model is in- 
spired by Frome et al. [ '] who associate words and images 
through a semantic embedding. More closely related is the 
work of Karpathy et al. [18], who decompose images and 
sentences into fragments and infer their inter-modal align- 
ment using a ranking objective. In contrast to their model 
which is based on grounding dependency tree relations, our 
model aligns contiguous segments of sentences which are 
more meaningful, interpretable, and not fixed in length. 

Neural networks in visual and language domains. Mul- 
tiple approaches have been developed for representing im- 
ages and words in higher-level representations. On the im- 
age side, Convolutional Neural Networks (CNNs) [ '5, 21] 
have recently emerged as a powerful class of models for 
image classification and object detection [ ]. On the sen- 
tence side, our work takes advantage of pretrained word 
vectors [32, 15, 2] to obtain low-dimensional representa- 
tions of words. Finally, Recurrent Neural Networks have 
been previously used in language modeling [31,41], but we 
additionally condition these models on images. 

3. Our IVIodel 

Overview. The ultimate goal of our model is to generate 
descriptions of image regions. During training, the input to 
our model is a set of images and their corresponding sen- 
tence descriptions (Figure 2). We first present a model that 
aligns segments of sentences to the visual regions that they 
describe through a multimodal embedding. We then treat 
these correspondences as training data for our multimodal 
Recurrent Neural Network model which learns to generate 
the descriptions. 

3.1. Learning to align visual and language data 

Our alignment model assumes an input dataset of images 
and their sentence descriptions. The key challenge to in- 
ferring the association between visual and textual data is 
that sentences written by people make multiple references 
to some particular, but unknown locations in the image. For 
example, in Figure 2, the words "Tabby cat is leaning " refer 
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training image 



"A Tabby cat is leaning 
on a wooden table, with 
one paw on a laser 
mouse and the other on 
a black laptop" 



Inferred correspondences 



training image 




"Tabby cat is leaning" 
"laser mouse" 
"paw" 

"black laptop" 
"wooden table" 



Generative model 



test image 




"office telephone" 
"shiny laptop" 
"Tabby cat is sleeping" 
"wooden office desk" 
messy pile of documents" 



Figure 2. Overview of our approach. A dataset of images and their sentence descriptions is the input to our model (left). Our model first 
infers the correspondences (middle) and then learns to generate novel descriptions (right). 



to the cat, the words "wooden table " refer to the table, etc. 
We would like to infer these latent correspondences, with 
the goal of later learning to generate these snippets from 
image regions. We build on the basic approach of Karpa- 
thy et al. [ ], who learn to ground dependency tree re- 
lations in sentences to image regions as part of a ranking 
objective. Our contribution is in the use of bidirectional 
recuiTent neural network to compute word representations 
in the sentence, dispensing of the need to compute depen- 
dency trees and allowing unbounded interactions of words 
and their context in the sentence. We also substantially sim- 
plify their objective and show that both modifications im- 
prove ranking performance. 

We first describe neural networks that map words and image 
regions into a common, multimodal embedding. Then we 
introduce our novel objective, which learns the embedding 
representations so that semantically similar concepts across 
the two modahties occupy nearby regions of the space. 

3.1.1 Representing images 

Following prior work [22, 18], we observe that sentence 
descriptions make frequent references to objects and their 
attributes. Thus, we follow the method of Girshick et al. 
[ I ] to detect objects in every image with a Region Convo- 
lutional Neural Network (RCNN). The CNN is pre-trained 
on ImageNet [ ] and finetuned on the 200 classes of the 
ImageNet Detection Challenge [36]. To establish fair com- 
parisons to Karpathy et al. [ ], we use the top 19 detected 
locations and the whole image and compute the represen- 
tations based on the pixels /f, inside each bounding box as 
follows: 
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where CNN{Ib) transforms the pixels inside bounding box 
lb into 4096-dimensional activations of the fully connected 
layer immediately before the classifier The CNN parame- 
ters 6c contain approximately 60 million parameters and the 
architecture closely follows the network of Krizhevsky et al 
[21]. The matrix Wm has dimensions h x 4096, where h is 
the size of the multimodal embedding space {h ranges from 
1000-1600 in our experiments). Every image is thus repre- 
sented as a set of ^.-dimensional vectors {vi \ i — 1 . . . 20}. 



3.1.2 Representing sentences 

To establish the inter-modal relationships, we would like 
to represent the words in the sentence in the same h- 
dimensional embedding space that the image regions oc- 
cupy. The simplest approach might be to project every in- 
dividual word directly into this embedding. However, this 
approach does not consider any ordering and word context 
information in the sentence. An extension to this idea is 
to use word bigrams, or dependency tree relations as pre- 
viously proposed [18]. However, this still imposes an ar- 
bitrary maximum size of the context window and requires 
the use of Dependency Tree Parsers that might be trained on 
unrelated text corpora. 

To address these concerns, we propose to use a bidirectional 
recurrent neural network (BRNN) [ ] to compute the word 
representations. In our setting, the BRNN takes a sequence 
of N words (encoded in a 1-of-k representation) and trans- 
forms each one into an /i-dimensional vector. However, the 
representation of each word is enriched by a variably-sized 
context around that word. Using the index t = 1 ... to 
denote the position of a word in a sentence, the precise form 
of the BRNN we use is as follows: 



et = fiWeXt + be) 

h{ = f{et + Wfh{_, + bf) 
h\ = f{et + Wkh\^^ + bb) 
St = f{Wd{h{ -^h\) + bd). 



(2) 
(3) 
(4) 
(5) 
(6) 



Here, If is an indicator column vector that is all zeros except 
for a single one at the index of the t-th word in a word vo- 
cabulary. The weights Ww specify a word embedding ma- 
trix that we initialize with 300-dimensional word2vec [^ ""] 
weights and keep fixed in our experiments due to overfitting 
concerns. Note that the BRNN consists of two independent 
streams of processing, one moving left to right {h{) and the 
other right to left (ft,j) (see Figure 3 for diagram). The fi- 
nal /i-dimensional representation st for the i-th word is a 
function of both the word at that location and also its sur- 
rounding context in the sentence. Technically, every sj is a 
function of all words in the entire sentence, but our empir- 
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Figure 3. Diagram for evaluating the image-sentence score Skt- 
Object regions are embedded with a CNN (left). Words (enriched 
by their context) are embedded in the same multimodal space with 
a BRNN (right). Pairwise similarities are computed with inner 
products (magnitudes shown in grayscale) and finally reduced to 
image-sentence score with Equation 8. 

ical finding is that the final word representations (st) align 
most strongly to the visual concept of the word at that lo- 
cation (If). Our hypothesis is that the strength of influence 
diminishes with each step of processing since St is a more 
direct function of It than of the other words in the sentence. 

We learn the parameters We , Wf , Wb , Wd and the respec- 
tive biases be,bf,bb,bii- A typical size of the hidden rep- 
resentation in our experiments ranges between 300-600 di- 
mensions. We set the activation function / to the rectified 
linear unit (ReLU), which computes / : x H> max{0, x). 

3.1.3 Alignment objective 

We have described the transformations that map every im- 
age and sentence into a set of vectors in a common h- 
dimensional space. Since our labels are at the level of en- 
tire images and sentences, our strategy is to formulate an 
image-sentence score as a function of the individual scores 
that measure how well a word aligns to a region of an im- 
age. Intuitively, a sentence-image pair should have a high 
matching score if its words have a confident support in the 
image. In Karpathy et al. [ ], they interpreted the dot 
product vfst between an image fragment i and a sentence 
fragment t as a measure of similarity and used these to de- 
fine the score between image k and sentence I as: 

5/0/ = ^ ^ max{0, vjst). (7) 

Here, gk is the set of image fragments in image k and gi 
is the set of sentence fragments in sentence /. The indices 
fc, I range over the images and sentences in the training set. 
Together with their additional Multiple Instance Learning 
objective, this score carries the interpretation that a sentence 



fragment aligns to a subset of the image regions whenever 
the dot product is positive. We found that the following 
reformulation simplifies the model and alleviates the need 
for additional objectives and their hyperparameters: 

Ski = ^ maxi(zg^vj St- (8) 

tegi 

Here, every word St aligns to the single best image region. 
As we show in the experiments, this simplified model also 
leads to improvements in the final ranking performance. 
Assuming that k — I denotes a corresponding image and 
sentence pair, the final max-margin, structured loss remains; 

k I 

^ ^ ^ 

rank images 

+ max(0, Sik - Skk + 1) • 
/ 

\ ^ / 

rank sentences 

This objective encourages aligned image-sentences pairs to 
have a higher score than misaligned pairs, by a margin. 

3.1.4 Decoding text segment alignments to images 

Consider an image from the training set and its correspond- 
ing sentence. We can interpret the quantity vj st as the un- 
normalized log probability of the th word describing any 
of the bounding boxes in the image. However, since we are 
ultimately interested in generating snippets of text instead 
of single words, we would like to align extended, contigu- 
ous sequences of words to a single bounding box. Note that 
the naive solution that assigns each word independently to 
the highest-scoring region is insufficient because it leads to 
words getting scattered inconsistently to different regions. 

To address this issue, we treat the true alignments as latent 
variables in a Markov Random Field (MRF) where the bi- 
nary interactions between neighboring words encourage an 
alignment to the same region. Concretely, given a sentence 
with N words and an image with M bounding boxes, we 
introduce the latent alignment variables aj G {1..M} for 
j = 1 . . . and formulate an MRF in a chain structure 
along the sentence as follows: 

E{z)= E v^fK'«.+i) (10) 

^ = 1...^ j"=l...Af-l 
xj;^{a,^t)^vjst (11) 
V'f (aj,aj+i) = /^l[aj = flj+i]- (12) 

Here, /? is a hyperparameter that controls the affinity to- 
wards longer word phrases. This parameter allows us to 
interpolate between single-word alignments (/3 = 0) and 
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Figure 4. Diagram of our multimodal Recurrent Neural Network 
generative model. The RNN takes an image, a word, the context 
from previous time steps and defines a distribution over the next 
word. START and END are special tokens. 

aligning the entire sentence to a single, maximally scoring 
region when (5 is large. We minimize the energy to find the 
best alignments a using dynamic programming. The output 
of this process is a set of image regions annotated with seg- 
ments of text. We now describe an approach for generating 
novel phrases based on these correspondences. 

3.2. Multimodal Recurrent Neural Network for 
generating descriptions 

In this section we assume an input set of images and their 
textual descriptions. These could be full images and their 
sentence descriptions, or regions and text snippets as dis- 
cussed in previous sections. The key challenge is in the de- 
sign of a model that can predict a variable-sized sequence 
of outputs. In previously developed language models based 
on Recurrent Neural Networks (RNNs) [ , , ], this is 
achieved by defining a probability distribution of the next 
word in a sequence, given the current word and context from 
previous time steps. We explore a simple but effective ex- 
tension that additionally conditions the generative process 
on the content of an input image. More formally, the RNN 
takes the image pixels / and a sequence of input vectors 
{xi , . . . ,xt)- It then computes a sequence of hidden states 
{hi, . . . ,ht) and a sequence of outputs {y i, ... ,yt) hy iter- 
ating the following recurrence relation for t = ItoT: 

^ Wh^[CNNg^{I)] (13) 

ht = f{Wh.Xt + Whhht-i + hh + &„) (14) 
yt = softmaxiWohht + bo). (15) 

In the equations above, Whi, Whx, Whh, Woh and bh, bo are 
a set of learnable weights and biases. The output vector yt 
has the size of the word dictionary and one additional di- 
mension for a special END token that terminates the gener- 
ative process. Note that we provide the image context vector 
by to the RNN at every iteration so that it does not have to 
remember the image content while generating words. 

RNN training. The RNN is trained to combine a word (xt), 
the previous context (/it_i) and the image information (by) 
to predict the next word (yt). Concretely, the training pro- 
ceeds as follows (refer to Figure 4): We set /iq = (), xi to 



a special START vector, and the desired label yi as the first 
word in the sequence. In particular, we use the word em- 
bedding for "the" as the START vector xi. Analogously, 
we set X2 to the word vector of the first word and expect the 
network to predict the second word, etc. Finally, on the last 
step when xt represents the last word, the target label is set 
to a special END token. The cost function is to maximize 
the log probability assigned to the target labels. 

RNN at test time. The RNN predicts a sentence as follows: 
We compute the representation of the image by, set ho = 0, 
xi to the embedding of the word "the", and compute the 
distribution over the first word yi. We sample from the dis- 
tribution (or pick the argmax), set its embedding vector as 
X2, and repeat this process until the END token is generated. 

3.3. Optimization 

We use Stochastic Gradient Descent with mini-batches of 
100 image-sentence pairs and momentum of 0.9 to optimize 
the alignment model. We cross-validate the learning rate 
and the weight decay. We also use dropout regularization in 
all layers except in the recurrent layers [ ]. The generative 
RNN is more difficult to optimize, party due to the word 
frequency disparity between rare words, and very common 
words (such as the END token). We achieved the best re- 
sults using RMSprop [ ], which is an adaptive step size 
method that scales the gradient of each weight by a running 
average of its gradient magnitudes. 

4. Experiments 

Datasets. We use the FHckrSK [14], FHckr30K [46] and 
COCO [ ] datasets in our experiments. These datasets 
contain 8,000, 31,000 and 123,000 images respectively 
and each is annotated with 5 sentences using Amazon 
Mechanical Turk. For FlickrSK and Flickr30K, we use 
1,000 images for validation, 1,000 for testing and the rest 
for training (consistent with [14, 18]). For COCO we use 
5,000 images for both validation and testing. 

Data Preprocessing. We convert all sentences to lower- 
case, discard non-alphanumeric characters, and filter out 
the articles "an", "a", and "the" for efficiency. Our word 
vocabulary contains 20,000 words. 

4.1. Image-Sentence Alignment Evaluation 

We first investigate the quality of the inferred text and im- 
age alignments. As a proxy for this evaluation we perform 
ranking experiments where we consider a withheld set of 
images and sentences and then retrieve items in one modal- 
ity given a query from the other We use the image-sentence 
score Ski (Section 3.1.3) to evaluate a compatibility score 
between all pairs of test images and sentences. We then re- 
port the median rank of the closest ground truth result in the 
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Table 1. Image- Sentence ranking experiment results. R@K is Recall@K (high is good). Med r is the median rank (low is good). In the 
results for our models, we take the top 5 validation set models, evaluate each independently on the test set and then report the average 
performance. The standard deviations on the recall values range from approximately 0.5 to 1.0. 



list and Recall @K, which measures the fraction of times a 
correct item was found among the top K results. The results 
of these experiments can be found in Table 1, and exam- 
ple retrievals in Figure 5. We now highlight some of the 
takeaways. 

Our full model outperforms previous work. We compare 
our full model ("Our model: BRNN") to the following base- 
lines: DeViSE [ ] is a model that learns a score between 
words and images. As the simplest extension to the setting 
of multiple image regions and multiple words, Karpathy et 
al. [Ir ] averaged the word and image region representa- 
tions to obtain a single vector for each modality. Socher et 
al. [3' ] is trained with a similar objective, but instead of 
averaging the word representations, they merge word vec- 
tors into a single sentence vector with a Recursive Neural 
Network. DeFrag are the results repotted by Karpathy et 
al. [" ]. Since we use different word vectors, dropout for 
regularization and different cross-validation ranges (includ- 
ing larger embedding sizes), we re-implemented their cost 
function for a fair comparison ("Our implementation of De- 
Frag"). In all of these cases, our full model ("Our model: 
BRNN") provides consistent improvements. 

Our simpler cost function improves performance. We 

now try to understand the sources of these improvements. 
First, we removed the BRNN and used dependency tree re- 
lations exactly as described in Kaipathy et al. [ ] ("Our 
model: DepTree edges"). The only difference between this 
model and "Our reimplementation of DeFrag" is the new, 
simpler cost function introduced in Section 3.1.3. We see 
that our formulation shows consistent improvements. 



BRNN outperforms dependency tree relations. Further- 
more, when we replace the dependency tree relations with 
the BRNN, we observe additional performance improve- 
ments. Since the dependency relations were shown to work 
better than single words and bigrams [18], this suggests that 
the BRNN is taking advantage of contexts longer than two 
words. Furthermore, our method does not rely on extracting 
a Dependency Tree and instead uses the raw words directly. 

COCO results for future comparisons. The COCO 
dataset has only recently been released, and we are not 
aware of other published ranking results. Therefore, we re- 
port results on a subset of 1,000 images and the full set of 
5,000 test images for future comparisons. 

Qualitative. As can be seen from example groundings in 
Figure 5, the model discovers interpretable visual-semantic 
correspondences, even for small or relatively rare objects 
such as "seagulls" and "accordion" . These details would 
be missed by models that only reason about full images. 

4.2. Evaluation of Generated Descriptions 

We have demonstrated that our alignment model produces 
state of the art ranking results and qualitative experiments 
suggest that the model effectively infers the alignment be- 
tween words and image regions. Our task is now to synthe- 
size these sentence snippets given new image regions. We 
evaluate these predictions with the BLEU [ ] score, which 
despite multiple problems [14, 22] is still considered to be 
the standard metric of evaluation in this setting. The BLEU 
score evaluates a candidate sentence by measuring the frac- 
tion of n-grams that appear in a set of references. 



Figure 5. Example alignments predicted by our model. For every test image above, we retrieve the most compatible test sentence and 
visualize the highest-scoring region for each word (before MRF smoothing described in Section 3.1.4) and the associated scores (vf st). 
We hide the alignments of low-scoring words to reduce clutter. We assign each region an arbitrary color. 

FlickrSK FlickrSOK COCO 

Method of generating text | B-1 B-2 B-3 | B-1 B-2 B-3 | B-1 B-2 B-3 

Human agreement 059 035 Ol6 064 036 016 057 031 013 

Ranking: Nearest Neighbor 029 OTI O03 027 O08 O02 032 OTI 0.03 

Generating: RNN | 0.42 0.19 0.06 0.45 0.20 0.06 | 0.50 0.25 0.12 

Table 2. BLEU score evaluation of full image predictions on 1,000 images. B-n is BLEU score that uses up to n-grams (high is good). 



Our multimodal RNN outperforms retrieval baseline. 

We first verify that our multimodal RNN is rich enough to 
support sentence generation for full images. In this experi- 
ment, we trained the RNN to generate sentences on full im- 
ages from FlickrSK, Flickr30K, and COCO datasets. Then 
at test time, we use the first four out of five sentences as 
references and the fifth one to evaluate human agreement. 
We also compare to a ranking baseline which uses the best 
model from the previous section (Section 4.1) to annotate 
each test image with the highest-scoring sentence from the 
training set. The quantitative results of this experiment are 
in Table 2. Note that the RNN model confidently outper- 
forms the retrieval method. This result is especially interest- 
ing in COCO dataset, since its training set consists of more 
than 600,000 sentences that cover a large variety of de- 
scriptions. Additionally, compared to the retrieval baseline 
which compares each image to all sentences in the training 
set, the RNN takes a fraction of a second to evaluate. 

We show example fullframe predictions in Figure 6. Our 
generative model (shown in blue) produces sensible de- 
scriptions, even in the last two images that we consider to 
be failure cases. Additionally, we verified that none of these 
sentences appear in the training set. This suggests that the 
model is not simply memorizing the training data. How- 



ever, there are 20 occurrences of "man in black shirt" and 
60 occurrences of "is paying guitar", which the model may 
have composed to describe the first image. 

Region-level evaluation. Finally, we evaluate our region 
RNN which was trained on the inferred, intermodal corre- 
spondences. To support this evaluation, we collected a new 
dataset of region-level annotations. Concretely, we asked 8 
people to label a subset of COCO test images with region- 
level text descriptions. The labeling interface consisted of 
a single test image, and the ability to draw a bounding box 
and annotate it with text. We provided minimal constraints 
and instructions, except to "describe the content of each 
box" and we encouraged the annotators to describe a large 
variety of objects, actions, stuff, and high-level concepts. 
The final dataset consists of 1469 annotations in 237 im- 
ages. There are on average 6.2 annotations per image, and 
each one is on average 4.13 words long. 

We compare three models on this dataset: The region RNN 
model, a fullframe RNN model that was trained on full im- 
ages and sentences, and a ranking baseline. To predict de- 
scriptions with the ranking baseline, we take the number 
of words in the shortest reference annotation and search the 
training set sentences for the highest scoring segment of text 





guy silling on chair tunes his guitar 
orchestra conductor is conducting orchestra 
man in black shirt is playing guitar 



worker in orange vest is using sho\'cl 
man wearing orange construction hat wipes his face while 
another man in orange construction hat hoists something up 

on chain pulley while standing in front of some fire little boy watches 

constmclion worker in orange safety vest is working on road two young girls are playing with legos toy 



new mom plays with her young baby 
two baby twins boy and girl baby girl plays with toy while 



m.-^ Iranipoliiies are fun way lo 
exercise 

cowgirls waving united 
states flag and Canada flag 
boy is doing backflip on 
wakeboard 



Figure 6. Example fullframe predictions. Green: human annotation. Red: Most compatible sentence in the training set (i.e. ranking 
baseline). Blue: Generated sentence using the fullframe multimodal RNN. We provide more examples in the supplementary material. 




Figure 7. Example region predictions. We use our region-level multimodal RNN to generate text (shown on the right of each image) for 
some of the bounding boxes in each image. The lines are grounded to centers of bounding boxes and the colors are chosen arbitrarily. 



Method of generating text 


B-1 


B-2 


B-3 


Human agreement 


0.54 


0.33 


0.16 


Ranking: Nearest Neighbor 


0.14 


0.03 


0.07 


Generating: Full frame model 


0.12 


0.03 


0.01 


Generating: Region level model 


0.17 


0.05 


0.01 



Table 3. BLEU score evaluation of image region annotations. 

of that length. This ensures that the ranking baseline does 
not accumulate any brevity penalty in its BLEU scores. 

We report the results in Table 3, and show example pre- 
dictions in Figure 7. To reiterate the difficulty of the task, 
consider that the phrase "table with wine glasses" that is 
generated on the middle image in Figure 7 only occurs in 
the training set 30 times. Each time it may have a different 
appearance and each time it may occupy a few (or none) 
of the bounding boxes. To generate this string for the im- 
age, the model had to correctly infer the correspondence and 
then learn to generate this string. 

There are several takeaway s from Table 3. First, the hu- 
man agreement baseline displays stronger performance rel- 
ative to our RNN models on the region-level task than the 
full image task. Additionally, the performance of the rank- 
ing baseline is now competitive with the RNN model. One 
possible explanation is that the snippets of text are shorter 
in this dataset, which makes it easier to find a good match 
in the training sentences. We believe that these results are 
an encouraging first step towards the task of dense scene 
descriptions, and we release our annotations so that future 
work can compare to these results. 



4.3. Limitations 

Although our results are encouraging, the RNN model is 
subject to multiple limitations. First, the model can only 
generate a description of one input array of pixels at a fixed 
resolution. A more sensible approach might be to use mul- 
tiple saccades around the image to identify all entities, their 
mutual interactions and wider context before generating a 
description. Additionally, the RNN (as formulated in Equa- 
tion 13) couples the visual and language domains in the hid- 
den representation only through additive interactions, which 
are known to be less expressive than more complicated mul- 
tiplicative interactions [41]. Lastly, going directly from an 
image-sentence dataset to region-level annotations as part 
of a single model that is trained end-to-end with a single 
objective remains an open problem. 

5. Conclusions 

We introduced a model that generates free-form descrip- 
tions of image regions based on weak labels in form of a 
dataset of images and sentences, and with very few hard- 
coded assumptions. Our approach relied on a novel struc- 
tured objective that aligned the visual and textual modalities 
through a common, multimodal embedding. We showed 
that this approach leads to consistent state of the art per- 
formance on ranking experiments across three datasets. We 
then described a multimodal Recurrent Neural Network ar- 
chitecture that generates textual descriptions based on im- 
age regions, and evaluated its performance with fullframe 
and region-level experiments. We showed that in both cases 
the multimodal RNN outperforms retrieval baselines. 
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